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Abstract. Support Vector Machines (SVMs) are well-established Ma- 
chine Learning (ML) algorithms. They rely on the fact that i) linear 
learning can be formalized as a well-posed optimization problem; ii) non- 
linear learning can be brought into linear learning thanks to the kernel 
trick and the mapping of the initial search space onto a high dimensional 
feature space. The kernel is designed by the ML expert and it governs the 
efficiency of the SVM approach. In this paper, a new approach for the au- 
tomatic design of kernels by Genetic Programming, called the Evolution- 
ary Kernel Machine (EKM), is presented. EKM combines a well-founded 
fitness function inspired from the margin criterion, and a co-evolution 
framework ensuring the computational scalability of the approach. Em- 
pirical validation on standard ML benchmark demonstrates that EKM is 
competitive using state-of-the-art SVMs with tuned hyper-parameters. 

1 Introduction 

Kernel methods, including the so-called Support Vector Machines (SVMs), are 
well-established learning approaches with both strong theoretical foundations 
and successful practical applications [T]. SVMs rely on two main advances in 
statistical learning. First, the linear supervised machine learning task is set as 
a well-posed (quadratic) optimization problem. Second, the above setting is ex- 
tended to non- linear learning via the kernel trick: given a (manually designed) 
change of representation <P mapping the initial space onto the so-called feature 
space, linear hypotheses are characterized in terms of the scalar product in the 
feature space, or kernel. These hypotheses correspond to non-linear hypotheses 
in the initial space. Although many specific kernels have been proposed in the 
literature, designing a kernel well suited for an application domain or a dataset 
so far remains an art more than a science. 

This paper proposes a system, the Evolutionary Kernel Machine (EKM), for 
the automatic design of data-specific kernels. EKM applies Genetic Programming 



(GP) 2 to construct symmetric functions (kernels), and optimizes a fitness 
function inspired from the margin criterion [3J. Kernels are assessed within a 
Nearest Neighbor classification process [415] . In order to cope with computational 
complexity, a cooperative co-evolution governs the prototype subset selection 
and the GP kernel design, while the fitness case subset selection undergoes a 
competitive co-evolution. 

The paper is organized as follows. Section[5]mtroduces the formal background 
and notations on kernel methods. Sections [3J and 2] respectively describe the 
GP representation and the fitness function proposed for the EKM. Scalability 
issues are addressed in the co-evolutionary framework introduced in Section [SJ 
Results on benchmark problems are given in Section [6] Finally, related works 
are discussed in Section [7] before concluding the paper in Section [H] 

2 Formal Background and Notations 

Supervised machine learning takes as input a dataset £ = {(xj,j/i), i = 1 . . .n, 
x, G X, yi G Y}, made of n examples; Xj and j/j respectively stand for the 
description and the label of the i-th example. The goal is to construct a hypoth- 
esis h(x) mapping X onto Y with minimal generalization error. Only vectorial 
domains (X = R d ) are considered throughout this paper; further, only binary 
classification problems (Y = {1, — 1}) are considered in the rest of this section. 

Due to space limitations, the reader is referred to [6] for a comprehensive 
presentation of SVMs. In the simplest (linear separable) case, the hyper-plane 
h(x) maximizing the geometrical margin (distance to the closest examples) is 
constructed. The label associated to example x is the sign of h(x), with: 



where < x, x« > denotes the scalar product of x and Xj. Let denotes a mapping 
from the instance space X onto the feature space and let the kernel K(x, x') be 
defined as: 



Under some conditions (the kernel trick), non- linear classifiers on X are con- 
structed as in the linear case, and characterized as h(x) = ^ ajK(x, x$) + b. 

Besides SVMs, the kernel trick can be used to revisit all learning meth- 
ods involving a distance measure. In the paper, the kernel nearest neighbor 
(Kernel-NN) algorithm [5], which revisits the fc-nearest neighbors (fc-NN) [4], is 
considered. Given a distance (or dissimilarity) function d(x, x') defined on the 
instance space X, given a set of labelled examples £ = {(xi, y{), . . . , (x„, y n )} 
and an instance x to be classified, the fc-NN algorithm: i) determines the fc ex- 
amples closest to x according to d(x, x'); ii) outputs the majority class of these 
k examples. Kernel-NN proceeds as fc-NN, where distance dx(x, x') is defined 
after the kernel K(x, x') (more on this in Section [4]). 




K : 1 x I h R; 



K(x,x') =< <2>(x),#(x') > 



Standard kernels on X — H d include Gaussian and polynomial kernels 3 . It 
must be noted that the addition, multiplication and compositions of kernels are 
kernels, and therefore the standard SVM machinery can find the optimal value 
of hyper-parameters (e.g. a, c or k) among a finite set. Quite the opposite, the 
functional (symbolic) optimization of K(x, x') cannot be tackled to our best 
knowledge except by Genetic Programming. 

3 Genetic Programming of Kernels 

The Evolutionary Kernel Machine applies GP to determine symmetric functions 
K(x, x') on M d x H d best suited to the dataset at hand. As shown in Table [T] 
the main difference compared to standard symbolic regression is that terminals 
are symmetric expressions of x and x' (e.g. Xi + x^, or Xix'j + Xjx'A, enforcing 
the symmetry of the kernels (K(x, x') = K(x',x)). 

The initialization of GP individuals is done using a ramped half and half pro- 
cedure [2]. The selection probability of terminals Ai, Mi, Ii and Si (respectively 
Cij) is divided by 1/d (resp. 2/d(d+ 1)), where d is the dimension of the initial 
instance space (X = ]R ). 

Indeed the kernel functions built after Table [1] might not satisfy Mercer's 
condition (K(x, x) < x = 0) required for SVM optimization [6 . However 
these kernels will be assessed along a Kernel- NN classification rule [5] ; therefore 
the fact that they are not necessarily positive is not a limitation. Quite the con- 
trary, EKM kernels can achieve feature selection; typically, terminals associated 
to non-informative features should disappear along evolution. The use of EKM 
for feature selection will be examined in a future work. 

4 Fitness Measure 

Every kernel K(x, x') is assessed after the Kernel- NN classification rule, using 
the dissimilarity dx defined as 

d K (x, x') 2 = K(x, x) + K(x', x') - 2K(x, x') 

Given a prototype set £ p = {(xi, yi), . . . , (x^, ye)} and a training example 
e = (x, y) , let us assuming that £ p is ordered by increasing dissimilarity to x 
(dx(x, x^ < di<(x, Xi + i)). Let p(e) denotes the minimum rank over all prototype 
examples in the same class as e (p(e) = min{i, yi = y, i = 1 . . .£}); let n(e) 
denotes the minimum rank over all other prototype examples (not belonging to 
the same class as e, n(e) = min{i, yi y, i = 1 . . . £})■ 

As noted by [3], the quality of the Kernel-NN classification of e can be as- 
sessed from Sk (e) = n(e) — p(e). The higher <5K(e), the more confident the clas- 
sification of e is, e.g. with respect to perturbations of £ p or dx; Sk(g) measures 
the margin of e with respect to Kernel-NN. 

3 Respectively K(x,x') = exp (- l|x ~^'" 2 ) and K(x,x') = (< x, x' > +c) k 



Table 1. GP primitives involved in the kernel functions K(x, x'), x, x' G R d . 



Name 


# args. 


Description 


ADD2 
ADD3 
ADD4 

SUB 
MUL2 
MUL3 
MUL4 

DIV 

MAX 
MIN 
EXP 
POW2 


2 
3 
4 

2 
2 
3 
4 

2 

2 
2 
1 
1 


Addition of two values, fADD 2 (oi, a 2 ) = Oi + 02- 
Addition of three values, fADD3(oi, o 2 , 03) — 01 + 0,2 + 03. 
Addition of four values, fADD4(oi, 02, 03, 04) = 0,1 + 0,2 + 0,3 + 
04. 

Subtraction, fsuB(ai, 02) = 01 — 02. 
Multiplication of two values, fMUL 2 (ai, 0,2) = aia 2 . 
Multiplication of three values, fMUL3(oi, 02, 03) = 010203. 
Multiplication of four values, fiviuL4(ai, 02, 03, 04) = 
01020304. 

r, • f , \ / 1 |o 2 | < 0.001 
Protected division, idiv (01,02) = s , ,\ 

^ 01/02 otherwise 

Maximum value, fiviAx(oi,02) = max(ai,a 2 ). 

Minimum value, fMiN(ai,02) = min(ai,a 2 ). 

Exponential value, fBXp(o) = exp(o). 

Square power, fpow2(a) = a 2 . 


Ai, i = 1. . .d 
Mi, i = l...d 
Si, i — 1 . . . d 
li, i = 1. . .d 
Cij, i — 1 . . . d 
j = 1 . . . i 
DOT 
EUC 
E 


ooo ooooo 


Add the i th components, Xi + x\. 
Multiply the i th components, Xix[. 
Maximum between the i th components, max(xi,x'i). 
Minimum between the i th components, mm(xi,x'i). 
Crossed multiplication-addition between the i th and j th com- 
ponents, (XiXj + XjXi). 
Scalar product of x and x', < x,x' >. 
Euclidean distance of x and x', | x — x'| . 
Ephemeral random constants, generated uniformly in [— 1, 1]. 



Accordingly, given a prototype set £ p — {(xi, y\ ),..., (x^, ye)} and a fit- 
ness case subset £ s — {(x^, y[), . . . , (x^, y' m )}, the fitness function associated to 
K(x, x') is defined as 

f ( k ) = ^Emxuo-^ 
tit, . n 

The computation of F has linear complexity in the number I of prototypes 
and in the number m of fitness cases. In a standard setting, £ p and £ s both 
coincide with the whole training set £ (£ = m = n). However the quadratic 
complexity of the fitness computation with respect to the number n of training 
examples is incompatible with the scalability of the approach. 

5 Tractability Through Co-evolution 

EKM scalability is obtained along two directions, by i) reducing the number t of 
prototypes used for classification, and ii) reducing the size m of the fitness case 
subset considered during each generation. 



Parameters: 

p : GP kernels population size; 

£ : Size of the prototype subset individuals; 

m : Size of the fitness case subset individuals; 

\ p : Number of offsprings in the prototype species; 

A s : Number of offsprings in the fitness case species; 

p p : Fraction of prototype subset individuals replaced by mutation; 

p B : Fraction of the fitness case subset individual replaced by mutation. 

f . £p\ initial prototype subset, stratified uniform sample of size £ from £ ; 

2. £ °: initial fitness case subset, stratified uniform sample of size m from £ ; 

3. GP°: initial population of GP kernels, {ft?, i — 1 . . .p}; 

4. Loop, for t = 1 . . . T: 

(a) Apply selection and variation operators to the GP t_1 kernel population, con- 
structing GP t = {h\, i = 1 . . .p}; 

(b) Compute the fitness F(/i'), i = 1 ...p with prototype subset f^ -1 and fitness 
case subset £%~ ; let h*'* denote the best one; 

(c) Generate X p offsprings of £p _1 , by replacing a fraction p p of the prototypes 
(uniform stratified sampling); assess these offsprings after ft*'* and £l~ ; set 
£p to the best offspring; 

(d) Generate \ a offsprings of £*~ , by replacing a fraction p a of the fitness case 
(uniform stratified sampling); assess these offsprings after ft*'* and £ l p \ set £\ 
to the best offspring. 

5. Output h** , selected among h 1 '* , t = . . .T as the one minimizing the 1-NN error 
rate on the whole training set £ using the associated £ l v prototype subset. 



Fig. 1. The Evolutionary Kernel Machine: a co-evolution framework 



More precisely, a co-evolutionary framework involving three species is con- 
sidered, as detailed in Figure [1] The first species includes the GP kernels. The 
second species includes the prototype subset (fixed-size subsets of the training 
set), subject to a cooperative co-evolution [7] with the GP kernels. The third 
species includes the fitness case subset (fixed-size subsets of the training set), 
subject to a competitive host-parasite co-evolution [S] with the GP kernels. 

The prototype species is evolved to find good prototypes such that they 
maximize the fitness of the GP kernels. The fitness case species is evolved to 
find hard and challenging examples, such that they minimize the kernel fitness. 
Of course there is a danger that the fitness case subset ultimately capture the 
noisy examples, as observed in the boosting framework [9] (see Section [672]) . 

Both prototype and selection species are initialized using a stratified uniform 
sampling with no replacement (the class distribution in the sample is the same 
as in the whole dataset and all examples are distinct). Both species are evolved 
using a (1, A) evolution strategy; in each generation, A offsprings are generated 
using a uniform stratified replacement of a given fraction of the parent subset, 
and assessed after the best kernel in the current kernel population. The parent 



Table 2. UCI data sets used for the experimentations. 



Data 




#of 


#of 




set 


Size 


features 


classes 


Application domain 


bcw 


683 


9 


2 


Wisconcin's breast cancer, 65% benign and 35% malignant. 


bid 


345 


6 


2 


BUPA liver disorders, 58% with disorders and 42% without 
disorder. 


bos 


508 


13 


3 


Boston housing, 34 % with median value v < 18.77 K$, 
33 % with v €] 18.77, 23.74], and 33 % with v > 23.74. 


cmc 


1473 


9 


3 


Contraceptive method choice, 43% not using contraception, 
35 % using short-term contraception, and 23 % using long- 
term contraception. 


ion 


351 


34 


2 


Ionosphere radar signal, 36 % without structure detected 
and 64 % with a structure detected. 


pid 


768 


8 


2 


Pima indians diabetes, 65% tested negative and 35% tested 
positive for diabetes. 



subset is replaced by the best offspring. In each generation, the kernels are 
assessed after the current prototype and fitness case individuals. 

6 Experimental Validation 

This section reports on the experimental validation of EKM, on a standard set of 
benchmark problems [10] , detailed in Table O The system is implemented using 
the Open BEAGLE framework 4 for evolutionary computation [TTj . 

6.1 Experimental Setting 

The parameters used in EKM are reported in Table [3l The average evolution 
time for one run is less than one hour (AMD Athlon 2800+) . 

On each problem, EKM has been evaluated along the standard 10-fold cross 
validation methodology. The whole data set is partitioned into 10 (stratified) 
subsets; the training set is made of all subsets but one; the best hypothesis 
learned from this training set is evaluated on the remaining subset, or test set. 
The accuracy is averaged over the 10 folds (as the test set ranges over the 10 
subsets of the whole dataset); for each fold, EKM is launched 10 times; the 5 best 
hypotheses (after their accuracy on the training set) are assessed on the test set; 
the reported accuracy is the average over the 10 folds of these 5 best hypotheses 
on the test set. In total, EKM is launched 100 times on each problem. 

EKM is compared to state of the art algorithms, including fc-nearest neigh- 
bor and SVMs with Gaussian kernels, similarly assessed using 10-fold cross val- 
idation. For fc-NN, the underlying distance is the Euclidean one, and scaling 

4 |http: // beagle .gel .ulaval . ca 



Table 3. Tableau of the evolutions parameters. 


Parameter 


Description and parameter values 


GP kernel functions evolution parameters 


Primitives 
GP population size 
Stop criterion 
Replacement strategy 
Selection 

Crossover 
Standard mutation 
Swap node mutation 


See Table [U 

One population of p = f 000 individuals 

Evolution ends after T — 100 generations. 

Genetic operations applied following generational scheme. 

Lexicographic parsimony pressure tournaments selection with 

7 participants. 

Classical subtree crossover [2] (prob. 0.7). 

Crossover with a random individual (prob. 0.1). 

Exchange a primitive with another of the same arity (prob. 

0.1). 

Replace a branch with one of its children and remove the 
branch mutated and the other children subtrees (if any) (prob. 
0.1). 


Shrink mutation 


Prototype subset selection parameters 


Prototype subset size 
Number of offsprings 
Mutation rate 


£ = 50 examples in a prototype subset. 
X p — 4 offsprings per generation. 

p p = 25 % of the prototype examples replaced in each muta- 
tion. 


Fitness case subset selection parameters 


Fitness case subset size 
Number of offsprings 
Mutation rate 


m = 100 examples in a fitness case subset. 
A s = 2 offsprings per generation. 

p a = 50% of the selection examples replaced in each mutation. 



normalization option has been considered; the k parameter has been varied in 
{1, 3, 5}; the best setting has been kept. For Gaussian SVMs, the Torch3 imple- 
mentation has been used [12] : the error cost (parameter C) has been varied in 
{10\ i = —3 . . . 4}, the tr parameter is set to 10, and the best setting has been 
similarly retained. 

6.2 Results 

Table d] shows the results obtained by EKM compared with fc-NN and Gaussian 
SVM, together with the optimal parameters for the latter algorithms. The size 
of the best GP kernel (last column) shows that no bloat occurred, thanks to 
the lexicographic parsimony pressure. Each algorithm is shown to be the best 
performing on the half or more of the tested datasets, with frequent ties according 
to a paired Student's t-test. 

Typically, the problems where Gaussian SVMs perform well are those where 
the optimal C value for cost error is high, suggesting that the noise level in these 
datasets is high too. Indeed, the fitness case subset selection embedded in EKM 
might favor the selection of noisy examples, as those are more challenging to GP 



Table 4. Comparative 10-fold results of fc-NN, Gaussian SVM and EKM on 
the UCI data sets, with optimal settings (fc and scaling for fc-NN, C for SVM). 
The reported test error is averaged over the 10 folds. For each fold tested with 
the EKM, the 5 solutions out of 10 runs with best training error are assessed 
on the test set, and their error is averaged. Test error rates in bold denotes 
the statistically best results according to a 95% two-tails paired Student's i-test. 
"Average rank" column gives the test error ranking obtained for EKM compared 
to fc-NN and SVM averaged over the 10 folds. 







fc-NN 






SVM 






EKM 




Data 


Best conf. 


Train 


Test 


Best 


Train 


Test 


Train 


Best-half 


Mean 


Average 


set 


k 


Scaling 


error 


error 


C 


error 


error 


error 


test error 


size 


rank 


bcw 


5 


No 


0.027 


0.025 


1 


0.030 


0.028 


0.020 


0.030 


167 


2.1 


bid 


5 


No 


0.336 


0.353 


1 


0.329 


0.325 


0.299 


0.309 


158 


1.5 


bos 


1 


Yes 


0.248 


0.235 


0.001 


0.224 


0.308 


0.253 


0.281 


116 


1.8 


cmc 


5 


No 


0.491 


0.486 


10 


0.273 


0.433 


0.479 


0.487 


129 


2.4 


ion 


1 


Yes 


0.134 


0.134 


100 


0.070 


0.071 


0.078 


0.095 


156 


1.9 


pid 


5 


Yes 


0.265 


0.255 


0.001 


0.315 


0.307 


0.237 


0.252 


145 


1.45 



kernels. A more progressive selection mechanism, taking into account all kernels 
in the GP population to better filter out noisy examples and outliers, will be 
considered in further research. 

The fc-NN outperforms SVM and EKM on the bos problem, where the noise 
level appears to be very low. Indeed, the optimal value for the number fc of 
nearest neighbors is fc = 1, while the optimal cost error is 10~ 3 , suggesting that 
the error rate is also low. Still, the fact that the error rate is close to 23% might 
be explained as the target concept is complex and/or many examples lie close 
to its frontier. On bcw, the differences between the three algorithms are not 
statistically different and the test error rate is about 2%, suggesting that the 
problem is rather easy. 

EKM is found to outperform the other algorithms on bid, demonstrating that 
Kernel-based dissimilarity can improve on Euclidean distance with and without 
rescaling. Last, EKM behaves like fc-NN on the pid problems. Further, it must 
be noted that EKM classifies the test examples using a 50-examples prototype 
set, whereas fc-NN uses the whole training set (above 300 examples in the bid 
problem and 690 in the pid problem). 

As the well-known No Free Lunch theorem applies to Machine Learning too, 
no learning method is expected to be universally competent. Rather, the above 
experimental validation demonstrates that the GP-evolved kernels can improve 
on standard kernels in some cases. 



7 Related Works 



The most relevant work to EKM is the Genetic Kernel Support Vector Machine 
(GK-SVM) [13] . GK-SVM similarly uses GP within an SVM-based approach, 
with two main differences compared to EKM. On one hand, GK-SVM focuses 
on feature construction, using GP to optimize mapping <P (instead of the kernel). 
On the other hand, the fitness function used in GK-SVM suffers from a quadratic 
complexity in the number of training examples. Accordingly, all datasets but 
one considered in the experimentations are small (less than 200 examples). On 
a larger dataset, the authors acknowledge that their approach does not improve 
on a standard SVM with well chosen parameters. Another related work similarly 
uses GP for feature construction, in order to classify time series [T3]. The set 
of features (GP trees) is further evolved using a GA, where the fitness function 
is based on the accuracy of an SVM classifier. Most other works related to 
evolutionary optimization within SVMs (see [T3]) actually focus on parametric 
optimization, e.g. achieving features selection or tuning some parameters. 

Another related work is proposed by Weinberger et al. [16] . optimizing a 
Mahalanobis distance based on the fc-NN margin criterion inspired from [3J and 
also used in EKM. However, restricted to linear changes of representation, the 
optimization problem is tackled by semi-definite programming in |16j . Lastly, 
EKM is also inspired by the Dynamic Subset Selection first proposed by Gath- 
ercole and Ross [17] and further developed by [18] to address scalability issues 
in EC-based Machine Learning. 

8 Conclusion 

The Evolutionary Kernel Machine proposed in this paper aims to improve kernel- 
based nearest neighbor classification [5], combining two original aspects. First, 
EKM implicitly addresses the feature construction problem by designing a new 
representation of the application domain better suited to the dataset at hand. 
However, in contrast with |13|14| . EKM takes advantage of the kernel trick, us- 
ing GP to optimize the kernel function. Secondly, EKM proposes a co-evolution 
framework to ensure the scalability of the approach and control the computa- 
tional complexity of the fitness computation. The empirical validation demon- 
strates that this new approach is competitive with well-founded learning algo- 
rithms such as SVM and fc-NN using tuned hyper-parameters. 

A limitation of the approach, also observed in the well-known boosting algo- 
rithm [9] , is that the competitive co-evolution of kernels and examples tends to 
favor noisy validation examples. A perspective for further research is to exploit 
the evolution archive, to estimate the probability for an example to be noisy and 
achieve a sensitivity analysis. Another perspective is to incorporate ensemble 
learning, typically bagging and boosting, within EKM. Indeed the diversity of 
the solutions constructed along population-based optimization enables ensemble 
learning almost for free. 
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